ndcg_score — Normalized Discounted Cumulative Gain (NDCG)#

NDCG is a ranking metric: it evaluates how well a model orders items so that the most relevant ones appear near the top.

Typical use cases:

  • web search / information retrieval

  • recommender systems (ranked lists)

  • learning-to-rank models

In this notebook you will:

  • build intuition for DCG (gain + position discount) and NDCG (normalization)

  • work through a concrete, hand-computed example with plots

  • implement dcg@k / ndcg@k from scratch in NumPy (and sanity-check vs scikit-learn)

  • see how NDCG is used when training a simple (linear) ranking model

Quick import#

from sklearn.metrics import ndcg_score

1) Ranking is not classification#

For a query \(q\) (a search query, a user in a recommender system, …) you usually have many candidate items:

  • true (graded) relevance labels: \(y_{q1}, \dots, y_{qn}\), with \(y_{qi} \ge 0\)

  • model scores: \(s_{q1}, \dots, s_{qn}\) (any real numbers)

Sorting scores induces a ranking (a permutation) \(\pi_q\) such that:

\[ s_{q\pi_q(1)} \ge s_{q\pi_q(2)} \ge \dots \ge s_{q\pi_q(n)} \]

The top positions matter most (users rarely scroll), and labels can be graded (e.g., 0 = irrelevant, 3 = perfect).

NDCG answers: “How good is my ranked list compared to the best possible ranking?”

2) DCG: reward relevant items early#

Let \(\mathrm{rel}_i\) be the true relevance of the item placed at rank \(i\).

A common choice is:

  • gain: \(g(\mathrm{rel}) = 2^{\mathrm{rel}} - 1\) (amplifies high relevance)

  • discount: \(d(i) = \frac{1}{\log_2(i + 1)}\) (penalizes lower ranks)

Then the Discounted Cumulative Gain at cutoff \(k\) is:

\[ \mathrm{DCG}@k = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)} \]

Notes:

  • Only the order of scores matters (any strictly monotonic transform of scores leaves DCG/NDCG unchanged, ignoring ties).

  • Some sources use linear gain \(g(\mathrm{rel}) = \mathrm{rel}\) instead.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.metrics import ndcg_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)
# Visualize the discount curve and the (exponential) gain function
k_max = 10
positions = np.arange(1, k_max + 1)
discounts = 1.0 / np.log2(positions + 1)

fig = px.line(
    x=positions,
    y=discounts,
    markers=True,
    title="Position discount used by DCG: 1 / log2(rank + 1)",
    labels={"x": "rank (1 = top result)", "y": "discount weight"},
)
fig.show()

rels = np.arange(0, 5)
gains = 2**rels - 1

fig = px.bar(
    x=rels,
    y=gains,
    title="Common gain function: g(rel) = 2^rel - 1 (graded relevance)",
    labels={"x": "relevance grade (rel)", "y": "gain"},
)
fig.show()

3) IDCG and NDCG: make DCG comparable across queries#

Raw DCG depends on the query’s “label mass” (how many relevant items exist, and how relevant they are).

So we normalize by the best possible DCG for that query:

  • \(\mathrm{IDCG}@k\) (Ideal DCG@k): compute DCG@k after sorting items by true relevance (best-first)

  • \(\mathrm{NDCG}@k\): scale to \([0, 1]\)

\[ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k} \]

Edge case: if \(\mathrm{IDCG}@k = 0\) (all items have relevance 0), we define \(\mathrm{NDCG}@k = 0\).

With multiple queries, you typically report the mean NDCG@k over queries.

# Worked example: one query with 8 candidate items
doc_ids = np.array(list("ABCDEFGH"))
y_true = np.array([3, 2, 3, 0, 1, 2, 0, 1])
y_score = np.array([0.60, 0.20, 0.80, 0.40, 0.10, 0.30, 0.05, 0.70])
k = 5

order_pred = np.argsort(-y_score, kind="mergesort")
order_ideal = np.argsort(-y_true, kind="mergesort")

rank = np.arange(1, len(doc_ids) + 1)
discount = 1.0 / np.log2(rank + 1)

def gain_exp(rel):
    return 2**rel - 1

rel_pred = y_true[order_pred]
gain_pred = gain_exp(rel_pred)
contrib_pred = gain_pred * discount
dcg_k = contrib_pred[:k].sum()

rel_ideal = y_true[order_ideal]
gain_ideal = gain_exp(rel_ideal)
contrib_ideal = gain_ideal * discount
idcg_k = contrib_ideal[:k].sum()

ndcg_k = dcg_k / idcg_k if idcg_k > 0 else 0.0

print(f"Predicted order: {''.join(doc_ids[order_pred])}")
print(f"Ideal order:     {''.join(doc_ids[order_ideal])}")
print(f"DCG@{k}:  {dcg_k:.4f}")
print(f"IDCG@{k}: {idcg_k:.4f}")
print(f"NDCG@{k}: {ndcg_k:.4f}")

sk_ndcg = ndcg_score(y_true[None, :], y_score[None, :], k=k)
print(f"sklearn ndcg_score@{k}: {sk_ndcg:.4f}")

# Any strictly monotonic transform of scores keeps the ranking the same
y_score_scaled = 10 * y_score + 5
sk_scaled = ndcg_score(y_true[None, :], y_score_scaled[None, :], k=k)
print(f"sklearn ndcg_score@{k} after scaling scores: {sk_scaled:.4f}")
Predicted order: CHADFBEG
Ideal order:     ACBFEHDG
DCG@5:  12.2915
IDCG@5: 14.5954
NDCG@5: 0.8421
sklearn ndcg_score@5: 0.8269
sklearn ndcg_score@5 after scaling scores: 0.8269
# Where does DCG@k come from? Break it into per-rank contributions.

docs_pred = doc_ids[order_pred]
scores_pred = y_score[order_pred]
rels_pred = y_true[order_pred]
gains_pred = gain_exp(rels_pred)
contribs_pred = gains_pred * discount

docs_ideal = doc_ids[order_ideal]
rels_ideal = y_true[order_ideal]
gains_ideal = gain_exp(rels_ideal)
contribs_ideal = gains_ideal * discount

rows = slice(0, k)

fig = go.Figure(
    data=[
        go.Table(
            header=dict(
                values=["rank", "doc", "y_true", "score", "gain", "discount", "gain*discount"],
                align="left",
            ),
            cells=dict(
                values=[
                    (np.arange(1, len(doc_ids) + 1)[rows]).tolist(),
                    docs_pred[rows].tolist(),
                    rels_pred[rows].tolist(),
                    np.round(scores_pred[rows], 3).tolist(),
                    gains_pred[rows].tolist(),
                    np.round(discount[rows], 3).tolist(),
                    np.round(contribs_pred[rows], 3).tolist(),
                ],
                align="left",
            ),
        )
    ]
)
fig.update_layout(title=f"Top-{k} contributions to DCG@{k} (predicted ranking)")
fig.show()

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=np.arange(1, k + 1),
        y=contribs_pred[:k],
        name="predicted",
    )
)
fig.add_trace(
    go.Bar(
        x=np.arange(1, k + 1),
        y=contribs_ideal[:k],
        name="ideal",
    )
)
fig.update_layout(
    title=f"Per-rank contributions: DCG@{k} vs IDCG@{k}",
    barmode="group",
    xaxis_title="rank",
    yaxis_title="gain(rel) / log2(rank+1)",
)
fig.show()
# NDCG as a function of k (for this one query)
k_list = np.arange(1, len(doc_ids) + 1)
ndcgs = []
for kk in k_list:
    dcg = contribs_pred[:kk].sum()
    idcg = contribs_ideal[:kk].sum()
    ndcgs.append(dcg / idcg if idcg > 0 else 0.0)
ndcgs = np.array(ndcgs)

fig = px.line(
    x=k_list,
    y=ndcgs,
    markers=True,
    title="NDCG@k for a single query",
    labels={"x": "k (cutoff)", "y": "NDCG@k"},
)
fig.update_yaxes(range=[0, 1.05])
fig.show()

4) From-scratch NumPy implementation#

scikit-learn expects:

  • y_true: array of shape (n_queries, n_docs) with non-negative relevance grades

  • y_score: array of shape (n_queries, n_docs) with arbitrary real-valued model scores

We’ll implement:

  • dcg_at_k_numpy(...) → DCG per query

  • ndcg_at_k_numpy(...) → NDCG per query

  • mean_ndcg_at_k_numpy(...) → average NDCG across queries

def _as_2d_same_shape(y_true, y_score):
    y_true = np.asarray(y_true, dtype=float)
    y_score = np.asarray(y_score, dtype=float)

    if y_true.shape != y_score.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_score{y_score.shape}")

    if y_true.ndim == 1:
        y_true = y_true[None, :]
        y_score = y_score[None, :]
    elif y_true.ndim != 2:
        raise ValueError("expected 1D or 2D arrays")

    if np.any(y_true < 0):
        raise ValueError("y_true must be non-negative for DCG/NDCG")

    return y_true, y_score


def _gain(relevance, scheme="exponential"):
    relevance = np.asarray(relevance, dtype=float)
    if scheme == "exponential":
        return np.power(2.0, relevance) - 1.0
    if scheme == "linear":
        return relevance
    raise ValueError("scheme must be 'exponential' or 'linear'")


def dcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
    y_true, y_score = _as_2d_same_shape(y_true, y_score)
    n_queries, n_docs = y_true.shape

    if k is None:
        k = n_docs
    k = int(k)
    if k <= 0:
        raise ValueError("k must be >= 1")
    k = min(k, n_docs)

    order = np.argsort(-y_score, axis=1, kind="mergesort")
    y_sorted = np.take_along_axis(y_true, order, axis=1)

    gains = _gain(y_sorted[:, :k], scheme=gain_scheme)
    discounts = 1.0 / np.log2(np.arange(2, k + 2))
    return np.sum(gains * discounts[None, :], axis=1)


def idcg_at_k_numpy(y_true, k=None, gain_scheme="exponential"):
    y_true = np.asarray(y_true, dtype=float)
    if y_true.ndim == 1:
        y_true = y_true[None, :]
    if y_true.ndim != 2:
        raise ValueError("expected 1D or 2D arrays")

    n_queries, n_docs = y_true.shape
    if k is None:
        k = n_docs
    k = int(k)
    if k <= 0:
        raise ValueError("k must be >= 1")
    k = min(k, n_docs)

    ideal_order = np.argsort(-y_true, axis=1, kind="mergesort")
    y_ideal = np.take_along_axis(y_true, ideal_order, axis=1)

    gains = _gain(y_ideal[:, :k], scheme=gain_scheme)
    discounts = 1.0 / np.log2(np.arange(2, k + 2))
    return np.sum(gains * discounts[None, :], axis=1)


def ndcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
    y_true, y_score = _as_2d_same_shape(y_true, y_score)
    dcg = dcg_at_k_numpy(y_true, y_score, k=k, gain_scheme=gain_scheme)
    idcg = idcg_at_k_numpy(y_true, k=k, gain_scheme=gain_scheme)
    return np.divide(dcg, idcg, out=np.zeros_like(dcg), where=idcg > 0)


def mean_ndcg_at_k_numpy(y_true, y_score, k=None, gain_scheme="exponential"):
    return float(ndcg_at_k_numpy(y_true, y_score, k=k, gain_scheme=gain_scheme).mean())


def dcg_at_k_expected_ties_1d(y_true, y_score, k=None, gain_scheme="exponential"):
    """Expected DCG@k under uniform random tie-breaking for equal scores (1 query).

    Useful for understanding how ties can change DCG/NDCG.
    """
    y_true = np.asarray(y_true, dtype=float)
    y_score = np.asarray(y_score, dtype=float)
    if y_true.ndim != 1 or y_score.ndim != 1:
        raise ValueError("expected 1D arrays")
    if y_true.shape != y_score.shape:
        raise ValueError("shape mismatch")
    if np.any(y_true < 0):
        raise ValueError("y_true must be non-negative")

    n = y_true.size
    if k is None:
        k = n
    k = int(k)
    if k <= 0:
        raise ValueError("k must be >= 1")
    k = min(k, n)

    order = np.argsort(-y_score, kind="mergesort")
    y_true_sorted = y_true[order]
    y_score_sorted = y_score[order]

    gains_sorted = _gain(y_true_sorted, scheme=gain_scheme)
    discounts = 1.0 / np.log2(np.arange(2, n + 2))

    dcg = 0.0
    start = 0
    while start < k:
        end = start + 1
        while end < n and y_score_sorted[end] == y_score_sorted[start]:
            end += 1
        group_size = end - start
        sum_discounts_in_top_k = discounts[start : min(end, k)].sum()
        dcg += gains_sorted[start:end].sum() * (sum_discounts_in_top_k / group_size)
        start = end
    return float(dcg)
# Sanity check against scikit-learn (ties are rare here due to continuous scores)
n_queries, n_docs = 30, 40
y_true_rand = rng.integers(0, 4, size=(n_queries, n_docs))
y_score_rand = rng.normal(size=(n_queries, n_docs))

for k in [1, 3, 5, 10, None]:
    sk = ndcg_score(y_true_rand, y_score_rand, k=k)
    np_impl = mean_ndcg_at_k_numpy(y_true_rand, y_score_rand, k=k)
    print(f"k={str(k):>4} | sklearn={sk:.6f} | numpy={np_impl:.6f} | abs diff={abs(sk-np_impl):.2e}")
k=   1 | sklearn=0.477778 | numpy=0.371429 | abs diff=1.06e-01
k=   3 | sklearn=0.463501 | numpy=0.357834 | abs diff=1.06e-01
k=   5 | sklearn=0.487567 | numpy=0.378214 | abs diff=1.09e-01
k=  10 | sklearn=0.495100 | numpy=0.391870 | abs diff=1.03e-01
k=None | sklearn=0.801454 | numpy=0.736041 | abs diff=6.54e-02

5) Ties and other pitfalls#

Ties in y_score#

If multiple items share exactly the same score, the ranking is not uniquely defined. Different tie-breaking rules can lead to different DCG/NDCG values.

Libraries often implement a tie-aware variant (expected value under random permutations) when ignore_ties=False.

All-zero relevance#

If all relevances are zero for a query, then \(\mathrm{IDCG}@k = 0\) and NDCG is defined as 0.

Label scaling#

With exponential gains \(2^{\mathrm{rel}} - 1\), going from relevance 2 to 3 matters more than 0 to 1. That’s often what you want in IR (highly relevant docs should dominate), but you should choose grades deliberately.

# Demonstrate how ties can create a range of possible NDCG values
y_true_tie = np.array([3, 2, 1, 0, 0])
y_score_tie = np.array([0.9, 0.8, 0.8, 0.8, 0.1])  # tie among 3 items
k = 5

# Stable tie-breaking (argsort with kind="mergesort")
ndcg_simple = ndcg_at_k_numpy(y_true_tie, y_score_tie, k=k)[0]

# Best/worst tie-breaking inside the tied group (same primary score ordering)
order_best = np.lexsort((-y_true_tie, -y_score_tie))
order_worst = np.lexsort((y_true_tie, -y_score_tie))

def ndcg_from_order(order):
    rel_sorted = y_true_tie[order]
    discounts = 1.0 / np.log2(np.arange(2, rel_sorted.size + 2))
    gains = (2**rel_sorted - 1) * discounts
    dcg = gains[:k].sum()
    idcg = idcg_at_k_numpy(y_true_tie[None, :], k=k)[0]
    return float(dcg / idcg) if idcg > 0 else 0.0

ndcg_best = ndcg_from_order(order_best)
ndcg_worst = ndcg_from_order(order_worst)

dcg_expected = dcg_at_k_expected_ties_1d(y_true_tie, y_score_tie, k=k)
idcg = idcg_at_k_numpy(y_true_tie[None, :], k=k)[0]
ndcg_expected = dcg_expected / idcg if idcg > 0 else 0.0

print(f"NDCG@{k} with stable tie-breaking:       {ndcg_simple:.4f}")
print(f"NDCG@{k} best-case within ties:          {ndcg_best:.4f}")
print(f"NDCG@{k} worst-case within ties:         {ndcg_worst:.4f}")
print(f"NDCG@{k} expected under random ties:     {ndcg_expected:.4f}")

sk_tie_aware = ndcg_score(y_true_tie[None, :], y_score_tie[None, :], k=k, ignore_ties=False)
sk_ignore_ties = ndcg_score(y_true_tie[None, :], y_score_tie[None, :], k=k, ignore_ties=True)
print(f"sklearn ndcg_score@{k} (ignore_ties=False): {sk_tie_aware:.4f}")
print(f"sklearn ndcg_score@{k} (ignore_ties=True):  {sk_ignore_ties:.4f}")

fig = px.bar(
    x=["worst", "expected", "deterministic", "best"],
    y=[ndcg_worst, ndcg_expected, ndcg_simple, ndcg_best],
    title=f"Tie handling can change NDCG@{k}",
    labels={"x": "tie handling", "y": f"NDCG@{k}"},
)
fig.update_yaxes(range=[0, 1.05])
fig.show()
NDCG@5 with stable tie-breaking:       1.0000
NDCG@5 best-case within ties:          1.0000
NDCG@5 worst-case within ties:         0.9360
NDCG@5 expected under random ties:     0.9669
sklearn ndcg_score@5 (ignore_ties=False): 0.9579
sklearn ndcg_score@5 (ignore_ties=True):  0.9159

6) Using NDCG when optimizing a simple ranking model#

NDCG depends on sorting, so it is not a smooth, differentiable function of the model parameters. In practice, learning-to-rank systems usually optimize a surrogate loss and track NDCG for evaluation / model selection.

We’ll compare two simple linear scoring approaches:

Pointwise regression (MSE)#

Treat each (query, doc) as an independent training example and predict the relevance grade:

\[ s_{qi} = x_{qi}^\top w \qquad \min_w \; \frac{1}{N}\sum_{q,i} (s_{qi} - y_{qi})^2 \]

Pairwise logistic loss (RankNet-style)#

For each query, form pairs \((i, j)\) where \(y_{qi} > y_{qj}\) and encourage \(s_{qi} > s_{qj}\):

\[ \min_w \; \mathbb{E}_{(q,i,j)}\big[\log(1 + \exp(-(s_{qi} - s_{qj})))\big] \]

Both produce scores you can rank; we’ll track mean NDCG@k during training.

# Synthetic learning-to-rank dataset
n_queries = 300
n_docs = 10
n_features = 6

X = rng.normal(size=(n_queries, n_docs, n_features))
w_star = rng.normal(size=(n_features,))

latent = X @ w_star + 0.2 * rng.normal(size=(n_queries, n_docs))

# Convert latent scores to graded relevance 0..3 *within each query*
q50 = np.quantile(latent, 0.50, axis=1, keepdims=True)
q75 = np.quantile(latent, 0.75, axis=1, keepdims=True)
q90 = np.quantile(latent, 0.90, axis=1, keepdims=True)

y_rel = np.zeros_like(latent, dtype=int)
y_rel += latent >= q50
y_rel += latent >= q75
y_rel += latent >= q90

# Train/validation split by query
perm = rng.permutation(n_queries)
n_train = int(0.8 * n_queries)
train_idx, val_idx = perm[:n_train], perm[n_train:]

X_train, y_train = X[train_idx], y_rel[train_idx]
X_val, y_val = X[val_idx], y_rel[val_idx]

k_eval = 5

def score_linear(X, w):
    # X: (n_queries, n_docs, n_features) -> scores: (n_queries, n_docs)
    return np.tensordot(X, w, axes=[2, 0])

w0 = rng.normal(scale=0.1, size=(n_features,))
baseline_val = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w0), k=k_eval)
print(f"Baseline mean NDCG@{k_eval} on val (random weights): {baseline_val:.4f}")
Baseline mean NDCG@5 on val (random weights): 0.4260
def train_pointwise_mse(
    X_train,
    y_train,
    X_val,
    y_val,
    k=5,
    lr=0.05,
    l2=1e-3,
    epochs=60,
):
    n_features = X_train.shape[2]
    w = rng.normal(scale=0.1, size=n_features)

    X_flat = X_train.reshape(-1, n_features)
    y_flat = y_train.reshape(-1).astype(float)

    history = []
    for epoch in range(epochs):
        preds = X_flat @ w
        err = preds - y_flat
        grad = (X_flat.T @ err) / y_flat.size + l2 * w
        w -= lr * grad

        train_ndcg = mean_ndcg_at_k_numpy(y_train, score_linear(X_train, w), k=k)
        val_ndcg = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w), k=k)
        mse = float(np.mean(err**2))
        history.append({"epoch": epoch, "train_ndcg": train_ndcg, "val_ndcg": val_ndcg, "mse": mse})
    return w, history


w_mse, hist_mse = train_pointwise_mse(X_train, y_train, X_val, y_val, k=k_eval)
print(f"Final val mean NDCG@{k_eval} (pointwise MSE): {hist_mse[-1]['val_ndcg']:.4f}")
Final val mean NDCG@5 (pointwise MSE): 0.9687
def _sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def _sample_pairs(y_query, max_pairs=40):
    # indices (i, j) such that y_i > y_j
    i_idx, j_idx = np.where(y_query[:, None] > y_query[None, :])
    if i_idx.size == 0:
        return np.empty((0,), dtype=int), np.empty((0,), dtype=int)
    if i_idx.size > max_pairs:
        sel = rng.choice(i_idx.size, size=max_pairs, replace=False)
        i_idx, j_idx = i_idx[sel], j_idx[sel]
    return i_idx.astype(int), j_idx.astype(int)


def make_pair_dataset(y_train, max_pairs_per_query=40):
    qs, is_, js = [], [], []
    for q in range(y_train.shape[0]):
        i_idx, j_idx = _sample_pairs(y_train[q], max_pairs=max_pairs_per_query)
        if i_idx.size:
            qs.append(np.full(i_idx.size, q, dtype=int))
            is_.append(i_idx)
            js.append(j_idx)
    if not qs:
        return np.empty((0,), dtype=int), np.empty((0,), dtype=int), np.empty((0,), dtype=int)
    return np.concatenate(qs), np.concatenate(is_), np.concatenate(js)


pair_q, pair_i, pair_j = make_pair_dataset(y_train, max_pairs_per_query=40)
print(f"Pairwise dataset size: {pair_q.size} pairs")


def train_pairwise_logistic(
    X_train,
    y_train,
    X_val,
    y_val,
    pair_q,
    pair_i,
    pair_j,
    k=5,
    lr=0.2,
    l2=1e-3,
    epochs=60,
):
    n_features = X_train.shape[2]
    w = rng.normal(scale=0.1, size=n_features)

    history = []
    for epoch in range(epochs):
        # x_diff: (n_pairs, n_features)
        x_diff = X_train[pair_q, pair_i] - X_train[pair_q, pair_j]
        d = x_diff @ w

        # log(1 + exp(-d)) in a stable way
        loss = float(np.mean(np.logaddexp(0.0, -d)) + 0.5 * l2 * np.sum(w**2))

        p = _sigmoid(d)
        grad = (x_diff * (p - 1.0)[:, None]).mean(axis=0) + l2 * w
        w -= lr * grad

        train_ndcg = mean_ndcg_at_k_numpy(y_train, score_linear(X_train, w), k=k)
        val_ndcg = mean_ndcg_at_k_numpy(y_val, score_linear(X_val, w), k=k)
        history.append({"epoch": epoch, "train_ndcg": train_ndcg, "val_ndcg": val_ndcg, "loss": loss})

    return w, history


w_rank, hist_rank = train_pairwise_logistic(
    X_train,
    y_train,
    X_val,
    y_val,
    pair_q,
    pair_i,
    pair_j,
    k=k_eval,
)
print(f"Final val mean NDCG@{k_eval} (pairwise logistic): {hist_rank[-1]['val_ndcg']:.4f}")
Pairwise dataset size: 7920 pairs
Final val mean NDCG@5 (pairwise logistic): 0.9810
# Compare training curves (mean NDCG@k)
epochs = [h["epoch"] for h in hist_mse]
mse_train = [h["train_ndcg"] for h in hist_mse]
mse_val = [h["val_ndcg"] for h in hist_mse]

rank_train = [h["train_ndcg"] for h in hist_rank]
rank_val = [h["val_ndcg"] for h in hist_rank]

fig = go.Figure()

fig.add_trace(go.Scatter(x=epochs, y=mse_val, mode="lines", name="val NDCG (pointwise MSE)"))
fig.add_trace(go.Scatter(x=epochs, y=rank_val, mode="lines", name="val NDCG (pairwise logistic)"))

fig.add_trace(go.Scatter(x=epochs, y=mse_train, mode="lines", name="train NDCG (pointwise MSE)", line=dict(dash="dot")))
fig.add_trace(go.Scatter(x=epochs, y=rank_train, mode="lines", name="train NDCG (pairwise logistic)", line=dict(dash="dot")))

fig.add_hline(y=baseline_val, line_dash="dash", line_color="gray", annotation_text="random baseline (val)")

fig.update_layout(
    title=f"Learning-to-rank toy example: mean NDCG@{k_eval} during training",
    xaxis_title="epoch",
    yaxis_title=f"mean NDCG@{k_eval}",
)
fig.update_yaxes(range=[0, 1.05])
fig.show()

Pros / cons and when to use NDCG#

Pros#

  • Works with graded relevance (not just relevant/irrelevant)

  • Emphasizes the top of the ranking via the discount and cutoff \(k\)

  • Normalized to \([0, 1]\) so it’s comparable across queries

  • Depends only on ranking order (invariant to strictly monotonic score transforms, ignoring ties)

Cons#

  • Non-smooth due to sorting → not directly optimized by standard gradient descent (use surrogates or black-box optimization)

  • Requires a query → candidate set structure; not a drop-in metric for standard single-label classification

  • Tie handling, gain definition, and choice of \(k\) can change results

  • If \(\mathrm{IDCG}@k = 0\) the metric is defined by convention (commonly 0)

Good fits#

  • Search ranking (documents / products)

  • Recommender systems where you evaluate a ranked slate

  • Learning-to-rank model evaluation (pairwise/listwise methods)

References#

  • Järvelin, K. & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS.

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html

  • Burges, C. et al. (2005). Learning to Rank using Gradient Descent (RankNet).